Download the Ensembl Gene list data we’ll be using:
https://wd.cri.uic.edu/pathway/GeneID_Conversion_list.txt
GeneID_Conversion_list.txt had 46 genes in the list
and 42 are converteddavid_id_conversion_results.tsv.
david_id_conversion.tsv in ExcelWe’re going to use the same Ensemble gene list as in the DAVID exercise:
https://wd.cri.uic.edu/pathway/GeneID_Conversion_list.txt
UniProt Knowledgebase has two databases
Observations:
We’re going to use the same Ensembl gene list as in the DAVID exercise:
https://wd.cri.uic.edu/pathway/GeneID_Conversion_list.txt
Note: This will automatically load left panel with Filters and Attributes.
GeneID_Conversion_list.txt. Note we are using
Ensembl Gene IDs as our input so we do not need to change anything in
select identifier dropdown.Select Attributes: Click “Attributes” in the left panel. This will open “Attributes” panel in the middle.
mart_export.txt.gz file in the Downloads
folder.Note for larger query one can choose to get Email notification.
https://www.ebi.ac.uk/QuickGO/For this exercise we will highlight several different collections.
Review results
https://www.genome.jp/keggThe KEGG Pathway entry page has the following details for the pathway map04370.
The KEGG Pathway view, gives a graphical representation of the pathway. In the pathway map, you can observe the following.
In this view, you can click on any of the protein/enzymes, compounds or associated pathways to view the details for that entry.
Above the pathway map are the following links
Side Panel
Note the Change pathway type button. This opens the organism menu (list of organisms/genomes) listing those organisms that have at least some portion of this pathway.
This list displays the organism specific versions of the pathway that are available. The number to the right of the organism name denotes the fraction of genes present in that organism.
Download the RNA-seq data we’ll be using:
https://wd.cri.uic.edu/pathway/RNAseq_example.xlsx
Next, we want to generate a list of differentially expressed genes:
Save these genes to a text file:
RNAseq_example.txt
https://wd.cri.uic.edu/pathway/RNAseq_example.txt
Navigate to DAVID https://david.ncifcrf.gov, and follow
the same steps as before to upload the new data set. First choose “Start
Analysis” from the top menu.
On the next page:
After uploading, click the “Functional Annotation Tool” link.
Note that DAVID has auto-detected the species from the ENSEMBL IDs. If we had uploaded gene symbols, we would need to select the correct species here.
We can choose which pathways to run enrichment against. DAVID has these organized into several categories. For today, we will enrich against Gene Ontology Biological Processes, and KEGG pathways (i.e., two databases of biological pathways).
Other databases here may be useful for other contexts, so it can be valuable to explore them. For example, for the other GO categories:
There are also pathways from Reactome available.
Select “Functional Annotation Chart” for the pathway enrichment statistics per term. Right-click on the Download File link to save as a file.
Open in Excel to view
A second result from DAVID is a functional annotation clustering, where it attempts to group terms that are similar into “clusters” of terms.
Select “Functional Annotation Clustering” to open this result. There are options to adjust the way that the clustering is performed. Again right-click on the Download File link to save as a file david_cluster.txt.
This can provide a useful way to parse a large list of similar terms. For example:
We’re going to use the same RNA-seq data set as in the DAVID exercise:
https://wd.cri.uic.edu/pathway/RNAseq_example.xlsx
We will save the entire “example_diff” tab as a tab-delimited text file to upload to IPA
To upload to IPA, go to File > Upload Dataset, and browse to the text file made above. IPA will automatically parse the file in the next dialog:
After loading, IPA opens a annotated dataset view, which lets you know how IPA is interpreting all of the genes. We can close this window.
Pathway enrichment analysis is part of the “Core Analysis”. Other modules include enrichment against upstream regulators, disease biomarkers, toxicity functions, and others. IPA runs everything at once.
For today, we’ve run this data set previously, so we’ll skip to those results.
Browse to the Analyses folder and double-click a result to open the core analysis.
Start with the Canonical Pathways tab.
If you want more detail into the structure of a particular pathway, double-click it (twice). This will open an interactive image.
We will be using the same RNA-seq data set as before:
https://wd.cri.uic.edu/pathway/RNAseq_example.xlsx
RNAseq_example.txt file from the DAVID
exerciseRNAseq_norm.txt)For consistency, we will obtain these data sets from RIC github repository. However, you can run the same exercise by saving tab-delimited text files from Excel and loading into R.
Open R Studio to perform this analysis, navigating to the folder
where you saved your RNAseq_example.txt and
RNAseq_norm.txt files.
# read in gene list from our server
degs <- read.delim("https://wd.cri.uic.edu/pathway/RNAseq_example.txt")
# remake as a vector
degs <- degs[,1]
# read in the antiviral gene list from RIC github repository
antiviral <- read.delim("https://wd.cri.uic.edu/pathway/antiviral_list.txt")
# remake as a vector
antiviral <- antiviral[,1]
head(antiviral)
## [1] "ENSMUSG00000042349" "ENSMUSG00000026896" "ENSMUSG00000027639"
## [4] "ENSMUSG00000028037" "ENSMUSG00000040296" "ENSMUSG00000039853"
# read in norm table from our server
norm <- read.delim("https://wd.cri.uic.edu/pathway/RNAseq_norm.txt",
row.names=1)
all_genes <- rownames(norm)
# check how big each list is
length(degs)
## [1] 2632
length(antiviral)
## [1] 43
length(all_genes)
## [1] 25931
# obtain true/false vectors for all genes based on intersection
# with degs or antiviral
antiviral_list <- all_genes %in% antiviral
degs_list <- all_genes %in% degs
# we can use table to confirm the number of genes in each
table(antiviral_list)
## antiviral_list
## FALSE TRUE
## 25888 43
table(degs_list)
## degs_list
## FALSE TRUE
## 23299 2632
# and we can use table to get the 2x2 contingency table
fet.table <- table(data.frame(antiviral_list, degs_list))
fet.table
## degs_list
## antiviral_list FALSE TRUE
## FALSE 23294 2594
## TRUE 5 38
# run fisher's exact test
fisher.test(fet.table)
##
## Fisher's Exact Test for Count Data
##
## data: fet.table
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 26.77896 223.26397
## sample estimates:
## odds ratio
## 68.31723
Antiviral genes are very strongly enriched: odds ratio is ~70x, p-value is very small.
Note: you may want to write a function to allow you do this calculation more rapidly. We will skip this for today, but here’s an example:
pathway_enrich <- function( degs, pathway, all_genes ){
pathway_list <- all_genes %in% pathway
degs_list <- all_genes %in% degs
fet.table <- table(data.frame(pathway_list, degs_list))
fet <- fisher.test(fet.table)
# return the odds ratio and p-value as a vector
result <- c(fet$p.value, fet$estimate)
names(result)[1] = "p.value"
return(result)
}
pathway_enrich(degs, antiviral, all_genes)
## p.value odds ratio
## 7.968619e-33 6.831723e+01
We will be using the same RNA-seq data set as before:
https://wd.cri.uic.edu/pathway/RNAseq_example.xlsx
Input .txt file prepared from the normalized expression table (“example_norm” tab). Again, for consistency we will obtain this file from RIC github repository. However, you can run the same exercise by saving tab-delimited text files from Excel and loading into R.
Open R Studio to prepare the input files. Use ENSEMBL IDs to go with GSEA’s built-in CHIP file for converting to gene symbols. Preparation steps:
colClasses parameter tells R
to ignore the columns named Gene.namenorm <- read.delim("https://wd.cri.uic.edu/pathway/RNAseq_norm.txt", row.names=1, colClasses=c(Gene.name="NULL"))
norm <- norm[rowMeans(norm) > 1,]
norm <- log2(norm + 0.1)
norm <- cbind(rownames(norm), rownames(norm), norm)
colnames(norm)[1:2] = c("NAME","DESCRIPTION")
write.table(norm, "RNAseq_for_gsea.txt", col.names=T, row.names=F, quote=F, sep="\t")
append=T parameter in
write.table to keep adding to the same file, and transpose vectors so
they’re written along the rows.groups <- c(rep("Control",4),rep("Infection",4))
write.table(t(c(8, 2, 1)), "RNAseq_for_gsea.cls", col.names=F, row.names=F)
write.table("# Control Infection", "RNAseq_for_gsea.cls", col.names=F, row.names=F,
append=T, quote=F)
write.table(t(groups), "RNAseq_for_gsea.cls", col.names=F, row.names=F, append=T, quote=F)
NOTE: you can also download these completed files if you have trouble with the above steps:
https://wd.cri.uic.edu/pathway/RNAseq_for_gsea.txt
https://wd.cri.uic.edu/pathway/RNAseq_for_gsea.cls
Go to the GenePattern cloud website:
https://cloud.genepattern.org/gp/pages/login.jsf
After logging in, enter “GSEA” under the Modules search box, and pick the first option (GSEA).
You can close your browser while the job runs. When you log back in, you can check the status or see results.
After results are finished, browse to the Jobs tab and click on the run name. Choose Download Job on the right.
https://wd.cri.uic.edu/pathway/gsea_out.zip
Unpack the zip folder, and find the main result file, index.html, double-click to open in a browser.
Within the detailed view:
We’re using chart results from an analysis on DAVID.
Open R Studio to perform this analysis:
# read in the results from DAVID
# the quote="" helps us to parse pathway names that have ' or " in them
david <- read.delim("https://wd.cri.uic.edu/pathway/david_chart.txt",
quote="")
# check how the names are interpreted in R
colnames(david)
## [1] "Category" "Term" "Count" "X."
## [5] "PValue" "Genes" "List.Total" "Pop.Hits"
## [9] "Pop.Total" "Fold.Enrichment" "Bonferroni" "Benjamini"
## [13] "FDR"
# sort by significance
david <- david[order(david$FDR),]
# for now, we'll focus on the KEGG pathways with FDR < 1%
david.kegg <- david[david$Category=="KEGG_PATHWAY" & david$FDR < 0.01,]
# log-scale the FDR and the enrichment ratio
david.kegg$logFDR <- -log10(david.kegg$FDR)
# the KEGG pathway names have IDs in them, fix it so that we just plot the name
head(david.kegg$Term)
## [1] "mmu04060:Cytokine-cytokine receptor interaction"
## [2] "mmu05140:Leishmaniasis"
## [3] "mmu04668:TNF signaling pathway"
## [4] "mmu05168:Herpes simplex infection"
## [5] "mmu04380:Osteoclast differentiation"
## [6] "mmu05162:Measles"
david.kegg$Term <- gsub("mmu[0-9]*:","",david.kegg$Term)
head(david.kegg$Term)
## [1] "Cytokine-cytokine receptor interaction"
## [2] "Leishmaniasis"
## [3] "TNF signaling pathway"
## [4] "Herpes simplex infection"
## [5] "Osteoclast differentiation"
## [6] "Measles"
# load the ggplot2 library
library(ggplot2)
NOTES FOR PLOT:
coord_flip() to plot the bars horizontallydavid.kegg$Term<-factor(david.kegg$Term, levels = rev(david.kegg$Term))
ggplot(data=david.kegg, aes(x=Term, y=logFDR)) + geom_col() + coord_flip() +
labs(y="-log10 FDR")
Example of same plot using the scales and a reverse log scale function
library(scales)
# We only really need to do this once, but this will allow us
# to quickly create a reverse log scale in ggplot2
revlog_trans <- trans_new("revlog",
function (x) { -log10(x) },
function (x) { 10 ^ (-x) }, log_breaks(base=10),
domain=c(1e-100, Inf))
ggplot(data=david.kegg, aes(x=Term, y=FDR)) + geom_col() + coord_flip() +
scale_y_continuous("FDR corrected p-value", trans=revlog_trans)
ggplot(data=david.kegg, aes(x=Term, y=Fold.Enrichment)) + geom_col() + coord_flip() +
labs(y="Fold Enrichment")
Compare pathway statistics for KEGG pathways in 3 different pathway enrichment results. These were generated on DAVID from three different gene lists generated from an RNA-seq clustering analysis.
For these files, we generated the Functional Annotation Chart in DAVID using a minimum of 1 molecule and a nominal p-value < 1, to include all terms in the comparison. We’ll filter for overall significance below.
Download the three result files, and combine together into a single data frame. This takes a bit of futzing:
kegg1 <- read.delim("https://wd.cri.uic.edu/pathway/cluster1_KEGG.txt",
quote="")
kegg2 <- read.delim("https://wd.cri.uic.edu/pathway/cluster2_KEGG.txt",
quote="")
kegg3 <- read.delim("https://wd.cri.uic.edu/pathway/cluster3_KEGG.txt",
quote="")
kegg1_subset <- kegg1[,c("Term", "FDR")]
kegg2_subset <- kegg2[,c("Term", "FDR")]
kegg3_subset <- kegg3[,c("Term", "FDR")]
colnames(kegg1_subset)[2] <- "Cluster1"
colnames(kegg2_subset)[2] <- "Cluster2"
colnames(kegg3_subset)[2] <- "Cluster3"
merge() to combine 2 data frames, but we
have 3 to combine.kegg_merged <- Reduce(function(x,y) merge(x=x, y=y, by="Term", all=T),
list(kegg1_subset, kegg2_subset, kegg3_subset))
head(kegg_merged)
## Term Cluster1 Cluster2
## 1 mmu00010:Glycolysis / Gluconeogenesis 1.0000000 0.9465988
## 2 mmu00020:Citrate cycle (TCA cycle) 0.9999999 NA
## 3 mmu00030:Pentose phosphate pathway 0.9999998 1.0000000
## 4 mmu00040:Pentose and glucuronate interconversions 1.0000000 1.0000000
## 5 mmu00051:Fructose and mannose metabolism 1.0000000 0.9999998
## 6 mmu00052:Galactose metabolism 1.0000000 1.0000000
## Cluster3
## 1 1.0000000
## 2 0.9995364
## 3 NA
## 4 1.0000000
## 5 1.0000000
## 6 1.0000000
kegg_merged[is.na(kegg_merged)] <- 1
kegg_df <- data.frame(kegg_merged)
kegg_subset <- kegg_df[ apply(kegg_df[,-1], 1, min) < 0.05, ]
kegg_subset$Term <- gsub("mmu[0-9]*:", "", kegg_subset$Term)
head(kegg_subset)
## Term Cluster1 Cluster2 Cluster3
## 15 Oxidative phosphorylation 1.349328e-03 3.585680e-08 0.999782250
## 18 Pyrimidine metabolism 9.422153e-01 3.476666e-02 0.009506058
## 81 Ribosome 6.922510e-10 2.964080e-16 1.000000000
## 87 DNA replication 7.539350e-06 2.945990e-04 0.963056903
## 88 Spliceosome 3.786304e-01 1.554310e-14 0.999992409
## 92 Base excision repair 1.000000e+00 9.999999e-01 0.002142762
We need to do a bit more reformatting to work with ggplot2, as it requires data in the long format to plot in side-by-side barplots.
kegg_subset2 <- kegg_subset[order(apply(kegg_subset[,-1], 1, min)), ]
tidyr package
pivot_longer to convert the data frame into
long format.library(tidyr)
head(kegg_subset2)
## Term Cluster1 Cluster2 Cluster3
## 81 Ribosome 6.922510e-10 2.96408e-16 1.000000000
## 88 Spliceosome 3.786304e-01 1.55431e-14 0.999992409
## 218 Parkinson's disease 2.707503e-02 1.23347e-10 0.981089407
## 220 Huntington's disease 6.105167e-02 6.58956e-10 0.851762867
## 271 Systemic lupus erythematosus 9.526860e-10 1.00000e+00 0.000195325
## 217 Alzheimer's disease 1.193811e-01 1.23151e-09 0.826030417
kegg_long <- pivot_longer(kegg_subset2, !Term, names_to = "cluster")
head(kegg_long)
## # A tibble: 6 × 3
## Term cluster value
## <chr> <chr> <dbl>
## 1 Ribosome Cluster1 6.92e-10
## 2 Ribosome Cluster2 2.96e-16
## 3 Ribosome Cluster3 1 e+ 0
## 4 Spliceosome Cluster1 3.79e- 1
## 5 Spliceosome Cluster2 1.55e-14
## 6 Spliceosome Cluster3 1.00e+ 0
fill=variable will color based on the different
clusters in the variable columnposition='dodge' will put the bars next to each
otherscale_x_discrete, To order the items using the
kegg_subset2 list.scale_y_continous, This will use our custom made
reverse log transformation for the scale.library(ggplot2)
library(scales)
ggplot(data=kegg_long, aes(x=Term, y=value, fill=cluster)) +
geom_col(position='dodge') + coord_flip() +
scale_x_discrete(limits=rev(kegg_subset2$Term)) +
scale_y_continuous("FDR corrected p-value", trans=revlog_trans)
geom_hline to create a
red line at an FDR value of 0.05.ggplot(data=kegg_long, aes(x=Term, y=value, fill=cluster)) +
geom_col(position='dodge') + coord_flip() +
scale_x_discrete(limits=rev(kegg_subset2$Term)) +
scale_y_continuous("FDR corrected p-value", trans=revlog_trans) +
geom_hline(yintercept = 0.05, color="red")
Also plot the prepared table as a heatmap.
library(circlize)
rownames(kegg_subset)<-kegg_subset$Term
kegg_subset$Term<-NULL
kegg_subset_revlog <- -log10(kegg_subset)
col_fun <- colorRamp2(c(0, -log10(0.05), max(unlist(kegg_subset_revlog))),
c("white","yellow","red"))
library(ComplexHeatmap)
heatmap <- Heatmap(as.matrix(kegg_subset_revlog), name="-log10 FDR", col=col_fun,
row_names_gp = gpar(fontsize = 5))
draw(heatmap, heatmap_legend_side = "left")
# We saw the breaks were at 0, 5, 10, and 15
breaks <- c(0, 5, 10, 15)
heatmap <- Heatmap(as.matrix(kegg_subset_revlog),
col=col_fun,
heatmap_legend_param = list(
at=breaks,
labels=10 ^ (-1 * breaks),
title="FDR"),
row_names_gp = gpar(fontsize = 5))
draw(heatmap, heatmap_legend_side = "left")
NOTE: some of these are repeats of steps we ran before. Pick the top KEGG pathway from our DAVID results and make a heatmap of the expression levels for genes in that pathway.
Use the results from DAVID from before, and the corresponding RNA-seq data set.
For consistency we will download these from RIC github repository. However, you can run the same exercise with your own DAVID results as well.
# read in the results from DAVID and RNA-seq normalized data
david <- read.delim("https://wd.cri.uic.edu/pathway/david_chart.txt",
quote="")
# sort by significance
david <- david[order(david$FDR),]
david.kegg <- david[david$Category=="KEGG_PATHWAY" & david$FDR < 0.01,]
# read in normalized expression
norm <- read.delim("https://wd.cri.uic.edu/pathway/RNAseq_norm.txt",
row.names=1, colClasses=c(Gene.name="NULL"))
Subset normalized expression to the genes in the top KEGG pathway:
# name of the top pathway
top.kegg.name <- david.kegg[1,"Term"]
# remove the KEGG ID
top.kegg.name <- gsub("mmu[0-9]*:","",top.kegg.name)
# get the gene list for the top pathway
# first remove the white space
top.kegg.genes <- gsub(" ", "", david.kegg[1,"Genes"])
# use string split to split the list into a vector by commas
# use unlist to turn it back into a vector
top.kegg.genes <- unlist(strsplit(top.kegg.genes,","))
length(top.kegg.genes)
## [1] 94
# subset the norm table to these genes
top.kegg.norm <- norm[top.kegg.genes,]
dim(top.kegg.norm)
## [1] 94 8
# log-scale and z-score
top.kegg.norm <- log2(top.kegg.norm + 0.1)
top.kegg.norm <- t(scale(t(top.kegg.norm)))
# plot in a heatmap
library(ComplexHeatmap)
Heatmap(top.kegg.norm, name = top.kegg.name, show_row_names = FALSE)
This is a result from variant calling in a whole-exome variant calling data set (one sample, human). We’ll do a pathway analysis of mutated genes, using two different filtering strategies:
NOTES:
SIFT (Sorting Interlant From Tolerant) is a model for predicting the effects of amino acid substitutions on protein function, based on sequence homology and known physical properties of amino acids. The SIFT model scores amino acid substitutions on a range from 0 to 1, where 0 is biggest effect, 1 is least effect. Damaging effects are predicted for scores < 0.05.
The filtering strategies outlined below are only a couple examples of how to filter variants. Other options include combining the strategies (damaging + mutation burden), filtering also based on the genotype call (heterozygous vs homozygous), using a different threshold for mutation burden, including other prediction models of damage to function (polyPhen, phyloP, etc.), and looking at known information about the variants in databases such as dbSNP, 1000 Genomes/ExAC/gnomAD, COSMIC (for cancer mutations), ClinVar, etc.
First, read in the variant table in R studio. It is available as a tab-delimited text file.
vars <- read.delim("https://wd.cri.uic.edu/pathway/example_variants.txt")
# look at vars in the R studio variable browser, or with head
head(vars)
## CHROM POS REF ALT Genotype Func Gene ExonicFunc
## 1 chr1 69270 A G 1/1 exonic OR4F5 synonymous_SNV
## 2 chr1 69511 A G 1/1 exonic OR4F5 nonsynonymous_SNV
## 3 chr1 930248 G A 0/1 exonic SAMD11 nonsynonymous_SNV
## 4 chr1 952421 A G 1/1 exonic NOC2L synonymous_SNV
## 5 chr1 953259 T C 1/1 exonic NOC2L synonymous_SNV
## 6 chr1 953279 T C 1/1 exonic NOC2L nonsynonymous_SNV
## AAChange SIFT_score SIFT_pred
## 1 OR4F5:ENST00000335137.3:exon1:c.A180G:p.S60S . .
## 2 OR4F5:ENST00000335137.3:exon1:c.A421G:p.T141A 0.652 T
## 3 SAMD11:ENST00000342066.7:exon3:c.G166A:p.G56S 0.038 D
## 4 NOC2L:ENST00000327044.6:exon10:c.T1182C:p.T394T . .
## 5 NOC2L:ENST00000327044.6:exon9:c.A918G:p.E306E . .
## 6 NOC2L:ENST00000327044.6:exon9:c.A898G:p.I300V 1.0 T
# filtering strategy 1: damaging effects
# variants with SIFT prediction D for damaging, just the Gene column
filter1 <- as.character(vars[vars$SIFT_pred=="D","Gene"])
# some gene annotations have a comma-separated list of genes
# strsplit will split them, and unlist will give it all as a vector
filter1 <- unlist(strsplit(filter1,","))
filter1 <- unique(filter1)
head(filter1)
## [1] "SAMD11" "PERM1" "ATAD3B" "SLC35E2" "CEP104" "CA6"
length(filter1)
## [1] 806
# filtering strategy 2: mutation burden
# remove variants with synonymous or unknown change to translated sequence
filter2_all <- as.character(vars[vars$ExonicFunc!="synonymous_SNV" &
vars$ExonicFunc!="unknown","Gene"])
# split by commas again
filter2_all <- unlist(strsplit(filter2_all,","))
# generate a count per gene using the table function
filter2_table <- table(filter2_all)
head(filter2_table)
## filter2_all
## A1BG A2ML1 A4GALT A4GNT AACS AADACL2
## 1 2 1 1 1 1
# get the gene names from the table with counts bigger than 3
filter2 <- names(filter2_table)[filter2_table > 3]
head(filter2)
## [1] "ABCA13" "ACAN" "ADGRF2" "ADGRV1" "AHNAK" "AHNAK2"
length(filter2)
## [1] 245
# check the number of genes in common between the lists
length(intersect(filter1, filter2))
## [1] 122
# write both lists to tables
write.table(filter1,"filter_damaging.txt",col.names=F,row.names=F,quote=F)
write.table(filter2,"filter_mutation_load.txt",col.names=F,row.names=F,quote=F)
NOTE: these gene lists are available from our server as well:
https://wd.cri.uic.edu/pathway/filter_damaging.txthttps://wd.cri.uic.edu/pathway/filter_mutation_load.txtNOTE: the DAVID portion of this exercise may be left as homework, depending on workshop timing. The visualization parts of the exercise can be completed using DAVID results we have already generated.
Go to DAVID https://david.ncifcrf.gov, and follow
similar steps as before:
In the chart view, note which dataset we’re looking at. Also modify the filters:
Prepare a heatmap comparison of the two results in R Studio. For consistency, we will use the results stored on our server, but you can read your own results into R also.
# read in tables; remember quote="" helps with parsing
# pathway names with quotes in their name
gobp.damaging <- read.delim(
"https://wd.cri.uic.edu/pathway/GOBP_damaging.txt",
quote="")
gobp.mutation_load <- read.delim(
"https://wd.cri.uic.edu/pathway/GOBP_mutation_load.txt",
quote="")
# process as we did before:
# subset the columns
sub.damaging <- gobp.damaging[,c("Term","FDR")]
sub.mutation_load <- gobp.mutation_load[,c("Term","FDR")]
# name the FDR column based on the gene list
colnames(sub.damaging)[2] <- "Damaging.Variants"
colnames(sub.mutation_load)[2] <- "Mutation.Load"
# merge and replace missing values with FDR = 1
gobp.merged <- merge(x=sub.damaging, y=sub.mutation_load, by="Term", all=T)
gobp.merged[is.na(gobp.merged)] <- 1
# prepare as a data frame with term IDs in the rownames
gobp.df <- data.frame(gobp.merged[,c(2:ncol(gobp.merged))])
rownames(gobp.df) <- gobp.merged[,1]
# subset to the significant terms and log-scale FDRs
gobp.subset <- gobp.df[apply(gobp.df, 1, min) < 0.05, ]
gobp.subset <- -log10(gobp.subset)
# remove the GO IDs from the term descriptions
rownames(gobp.subset) <- gsub("GO:[0-9]*~","",rownames(gobp.subset))
# plot in a heatmap, setting a color scale first
library(circlize)
col_fun <- colorRamp2(c(0, -log10(0.05), max(unlist(gobp.subset))),
c("white","yellow","red"))
library(ComplexHeatmap)
heatmap <- Heatmap(as.matrix(gobp.subset), name="-log10 FDR", col=col_fun,
row_names_gp = gpar(fontsize = 5))
draw(heatmap, heatmap_legend_side = "left")
Download antiviral gene list.
https://wd.cri.uic.edu/pathway/antiviral_list.txt
NOTE: If STRING asks you to select an organism, type in “Mus musculus”.
Download the miRBase IDs
https://wd.cri.uic.edu/pathway/miR-ids.txt
mir-ids.txt